{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Relation extraction using distant supervision: experiments"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "__author__ = \"Bill MacCartney and Christopher Potts\"\n",
    "__version__ = \"CS224u, Stanford, Fall 2020\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Contents\n",
    "\n",
    "1. [Overview](#Overview)\n",
    "1. [Set-up](#Set-up)\n",
    "1. [Building a classifier](#Building-a-classifier)\n",
    "  1. [Featurizers](#Featurizers)\n",
    "  1. [Experiments](#Experiments)\n",
    "1. [Analysis](#Analysis)\n",
    "  1. [Examining the trained models](#Examining-the-trained-models)\n",
    "  1. [Discovering new relation instances](#Discovering-new-relation-instances)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overview\n",
    "\n",
    "OK, it's time to get (halfway) serious. Let's apply real machine learning to train a classifier on the training data, and see how it performs on the test data. We'll begin with one of the simplest machine learning setups: a bag-of-words feature representation, and a linear model trained using logistic regression.\n",
    "\n",
    "Just like we did in the unit on [supervised sentiment analysis](https://github.com/cgpotts/cs224u/blob/master/sst_02_hand_built_features.ipynb), we'll leverage the `sklearn` library, and we'll introduce functions for featurizing instances, training models, making predictions, and evaluating results."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set-up\n",
    "\n",
    "See [the first notebook in this unit](rel_ext_01_task.ipynb#Set-up) for set-up instructions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import Counter\n",
    "import os\n",
    "import rel_ext\n",
    "import utils"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set all the random seeds for reproducibility. Only the\n",
    "# system seed is relevant for this notebook.\n",
    "\n",
    "utils.fix_random_seeds()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "rel_ext_data_home = os.path.join('data', 'rel_ext_data')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With the following steps, we build up the dataset we'll use for experiments; it unites a corpus and a knowledge base in the way we described in [the previous notebook](rel_ext_01_task.ipynb)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "corpus = rel_ext.Corpus(os.path.join(rel_ext_data_home, 'corpus.tsv.gz'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "kb = rel_ext.KB(os.path.join(rel_ext_data_home, 'kb.tsv.gz'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset = rel_ext.Dataset(corpus, kb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following code splits up our data in a way that supports experimentation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'tiny': Corpus with 3,474 examples; KB with 445 triples,\n",
       " 'train': Corpus with 249,003 examples; KB with 34,229 triples,\n",
       " 'dev': Corpus with 79,219 examples; KB with 11,210 triples,\n",
       " 'all': Corpus with 331,696 examples; KB with 45,884 triples}"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "splits = dataset.build_splits()\n",
    "\n",
    "splits"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Building a classifier"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Featurizers\n",
    "\n",
    "Featurizers are functions which define the feature representation for our model. The primary input to a featurizer will be the `KBTriple` for which we are generating features. But since our features will be derived from corpus examples containing the entities of the `KBTriple`, we must also pass in a reference to a `Corpus`. And in order to make it easy to combine different featurizers, we'll also pass in a feature counter to hold the results.\n",
    "\n",
    "Here's an implementation for a very simple bag-of-words featurizer. It finds all the corpus examples containing the two entities in the `KBTriple`, breaks the phrase appearing between the two entity mentions into words, and counts the words. Note that it makes no distinction between \"forward\" and \"reverse\" examples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "def simple_bag_of_words_featurizer(kbt, corpus, feature_counter):\n",
    "    for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):\n",
    "        for word in ex.middle.split(' '):\n",
    "            feature_counter[word] += 1\n",
    "    for ex in corpus.get_examples_for_entities(kbt.obj, kbt.sbj):\n",
    "        for word in ex.middle.split(' '):\n",
    "            feature_counter[word] += 1\n",
    "    return feature_counter"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here's how this featurizer works on a single example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "KBTriple(rel='contains', sbj='Brickfields', obj='Kuala_Lumpur_Sentral_railway_station')"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kbt = kb.kb_triples[0]\n",
    "\n",
    "kbt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'it was just a quick 10-minute walk to'"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "corpus.get_examples_for_entities(kbt.sbj, kbt.obj)[0].middle"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Counter({'it': 1,\n",
       "         'was': 1,\n",
       "         'just': 1,\n",
       "         'a': 1,\n",
       "         'quick': 1,\n",
       "         '10-minute': 1,\n",
       "         'walk': 1,\n",
       "         'to': 2,\n",
       "         'the': 1})"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "simple_bag_of_words_featurizer(kb.kb_triples[0], corpus, Counter())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can experiment with adding new kinds of features just by implementing additional featurizers, following `simple_bag_of_words_featurizer` as an example.\n",
    "\n",
    "Now, in order to apply machine learning algorithms such as those provided by `sklearn`, we need a way to convert datasets of `KBTriple`s into feature matrices. The following steps achieve that: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "kbts_by_rel, labels_by_rel = dataset.build_dataset()\n",
    "\n",
    "featurized = dataset.featurize(kbts_by_rel, featurizers=[simple_bag_of_words_featurizer])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Experiments\n",
    "\n",
    "Now we need some functions to train models, make predictions, and evaluate the results. We'll start with `train_models()`. This function takes as arguments a dictionary of data splits, a list of featurizers, the name of the split on which to train (by default, 'train'), and a model factory, which is a function which initializes an `sklearn` classifier (by default, a logistic regression classifier). It returns a dictionary holding the featurizers, the vectorizer that was used to generate the training matrix, and a dictionary holding the trained models, one per relation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_result = rel_ext.train_models(\n",
    "    splits,\n",
    "    featurizers=[simple_bag_of_words_featurizer])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next comes `predict()`. This function takes as arguments a dictionary of data splits, the outputs of `train_models()`, and the name of the split for which to make predictions. It returns two parallel dictionaries: one holding the predictions (grouped by relation), the other holding the true labels (again, grouped by relation)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "predictions, true_labels = rel_ext.predict(\n",
    "    splits, train_result, split_name='dev')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now `evaluate_predictions()`. This function takes as arguments the parallel dictionaries of predictions and true labels produced by `predict()`. It prints summary statistics for each relation, including precision, recall, and F<sub>0.5</sub>-score, and it returns the macro-averaged F<sub>0.5</sub>-score."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "relation              precision     recall    f-score    support       size\n",
      "------------------    ---------  ---------  ---------  ---------  ---------\n",
      "adjoins                   0.832      0.378      0.671        407       7057\n",
      "author                    0.779      0.525      0.710        657       7307\n",
      "capital                   0.638      0.294      0.517        126       6776\n",
      "contains                  0.783      0.608      0.740       4487      11137\n",
      "film_performance          0.796      0.591      0.745        984       7634\n",
      "founders                  0.783      0.384      0.648        469       7119\n",
      "genre                     0.654      0.166      0.412        205       6855\n",
      "has_sibling               0.865      0.246      0.576        625       7275\n",
      "has_spouse                0.878      0.342      0.668        754       7404\n",
      "is_a                      0.731      0.238      0.517        618       7268\n",
      "nationality               0.555      0.171      0.383        386       7036\n",
      "parents                   0.862      0.544      0.771        390       7040\n",
      "place_of_birth            0.637      0.206      0.449        282       6932\n",
      "place_of_death            0.512      0.100      0.282        209       6859\n",
      "profession                0.716      0.205      0.477        308       6958\n",
      "worked_at                 0.688      0.254      0.513        303       6953\n",
      "------------------    ---------  ---------  ---------  ---------  ---------\n",
      "macro-average             0.732      0.328      0.567      11210     117610\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "0.5674055479292028"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "rel_ext.evaluate_predictions(predictions, true_labels)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we introduce `rel_ext.experiment()`, which basically chains together `rel_ext.train_models()`, `rel_ext.predict()`, and `rel_ext.evaluate_predictions()`. For convenience, this function returns the output of `rel_ext.train_models()` as its result."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Running `rel_ext.experiment()` in its default configuration will give us a baseline result for machine-learned models."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "relation              precision     recall    f-score    support       size\n",
      "------------------    ---------  ---------  ---------  ---------  ---------\n",
      "adjoins                   0.832      0.378      0.671        407       7057\n",
      "author                    0.779      0.525      0.710        657       7307\n",
      "capital                   0.638      0.294      0.517        126       6776\n",
      "contains                  0.783      0.608      0.740       4487      11137\n",
      "film_performance          0.796      0.591      0.745        984       7634\n",
      "founders                  0.783      0.384      0.648        469       7119\n",
      "genre                     0.654      0.166      0.412        205       6855\n",
      "has_sibling               0.865      0.246      0.576        625       7275\n",
      "has_spouse                0.878      0.342      0.668        754       7404\n",
      "is_a                      0.731      0.238      0.517        618       7268\n",
      "nationality               0.555      0.171      0.383        386       7036\n",
      "parents                   0.862      0.544      0.771        390       7040\n",
      "place_of_birth            0.637      0.206      0.449        282       6932\n",
      "place_of_death            0.512      0.100      0.282        209       6859\n",
      "profession                0.716      0.205      0.477        308       6958\n",
      "worked_at                 0.688      0.254      0.513        303       6953\n",
      "------------------    ---------  ---------  ---------  ---------  ---------\n",
      "macro-average             0.732      0.328      0.567      11210     117610\n"
     ]
    }
   ],
   "source": [
    "_ = rel_ext.experiment(\n",
    "    splits,\n",
    "    featurizers=[simple_bag_of_words_featurizer])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Considering how vanilla our model is, these results are quite surprisingly good! We see huge gains for every relation over our `top_k_middles_classifier` from [the previous notebook](rel_ext_01_task.ipynb#A-simple-baseline-model). This strong performance is a powerful testament to the effectiveness of even the simplest forms of machine learning.\n",
    "\n",
    "But there is still much more we can do. To make further gains, we must not treat the model as a black box. We must open it up and get visibility into what it has learned, and more importantly, where it still falls down."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Examining the trained models\n",
    "\n",
    "One important way to gain understanding of our trained model is to inspect the model weights. What features are strong positive indicators for each relation, and what features are strong negative indicators?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Highest and lowest feature weights for relation adjoins:\n",
      "\n",
      "     2.511 Córdoba\n",
      "     2.467 Taluks\n",
      "     2.434 Valais\n",
      "     ..... .....\n",
      "    -1.143 for\n",
      "    -1.186 Egypt\n",
      "    -1.277 America\n",
      "\n",
      "Highest and lowest feature weights for relation author:\n",
      "\n",
      "     3.055 author\n",
      "     3.032 books\n",
      "     2.342 by\n",
      "     ..... .....\n",
      "    -2.002 directed\n",
      "    -2.019 or\n",
      "    -2.211 poetry\n",
      "\n",
      "Highest and lowest feature weights for relation capital:\n",
      "\n",
      "     3.922 capital\n",
      "     2.163 especially\n",
      "     2.155 city\n",
      "     ..... .....\n",
      "    -1.238 and\n",
      "    -1.263 being\n",
      "    -1.959 borough\n",
      "\n",
      "Highest and lowest feature weights for relation contains:\n",
      "\n",
      "     2.768 bordered\n",
      "     2.716 third-largest\n",
      "     2.219 tiny\n",
      "     ..... .....\n",
      "    -3.502 Midlands\n",
      "    -3.954 Siege\n",
      "    -3.969 destroyed\n",
      "\n",
      "Highest and lowest feature weights for relation film_performance:\n",
      "\n",
      "     4.004 starring\n",
      "     3.731 alongside\n",
      "     3.199 opposite\n",
      "     ..... .....\n",
      "    -1.702 then\n",
      "    -1.840 She\n",
      "    -1.889 Genghis\n",
      "\n",
      "Highest and lowest feature weights for relation founders:\n",
      "\n",
      "     3.677 founded\n",
      "     3.276 founder\n",
      "     2.779 label\n",
      "     ..... .....\n",
      "    -1.795 William\n",
      "    -1.850 Griffith\n",
      "    -1.854 Wilson\n",
      "\n",
      "Highest and lowest feature weights for relation genre:\n",
      "\n",
      "     3.092 series\n",
      "     2.800 game\n",
      "     2.622 album\n",
      "     ..... .....\n",
      "    -1.296 animated\n",
      "    -1.434 and\n",
      "    -1.949 at\n",
      "\n",
      "Highest and lowest feature weights for relation has_sibling:\n",
      "\n",
      "     5.196 brother\n",
      "     3.933 sister\n",
      "     2.747 nephew\n",
      "     ..... .....\n",
      "    -1.293 '\n",
      "    -1.312 from\n",
      "    -1.437 including\n",
      "\n",
      "Highest and lowest feature weights for relation has_spouse:\n",
      "\n",
      "     5.319 wife\n",
      "     4.652 married\n",
      "     4.617 husband\n",
      "     ..... .....\n",
      "    -1.528 between\n",
      "    -1.559 MTV\n",
      "    -1.599 Terri\n",
      "\n",
      "Highest and lowest feature weights for relation is_a:\n",
      "\n",
      "     3.182 family\n",
      "     2.898 philosopher\n",
      "     2.623 \n",
      "     ..... .....\n",
      "    -1.411 now\n",
      "    -1.441 beans\n",
      "    -1.618 at\n",
      "\n",
      "Highest and lowest feature weights for relation nationality:\n",
      "\n",
      "     2.887 born\n",
      "     1.933 president\n",
      "     1.843 caliph\n",
      "     ..... .....\n",
      "    -1.467 or\n",
      "    -1.540 ;\n",
      "    -1.729 American\n",
      "\n",
      "Highest and lowest feature weights for relation parents:\n",
      "\n",
      "     5.108 son\n",
      "     4.437 father\n",
      "     4.400 daughter\n",
      "     ..... .....\n",
      "    -1.053 a\n",
      "    -1.070 England\n",
      "    -1.210 in\n",
      "\n",
      "Highest and lowest feature weights for relation place_of_birth:\n",
      "\n",
      "     3.980 born\n",
      "     2.843 birthplace\n",
      "     2.702 mayor\n",
      "     ..... .....\n",
      "    -1.276 Mughal\n",
      "    -1.392 or\n",
      "    -1.426 and\n",
      "\n",
      "Highest and lowest feature weights for relation place_of_death:\n",
      "\n",
      "     2.161 assassinated\n",
      "     2.027 died\n",
      "     1.837 Germany\n",
      "     ..... .....\n",
      "    -1.246 ;\n",
      "    -1.256 as\n",
      "    -1.474 Siege\n",
      "\n",
      "Highest and lowest feature weights for relation profession:\n",
      "\n",
      "     3.148 \n",
      "     2.727 American\n",
      "     2.635 philosopher\n",
      "     ..... .....\n",
      "    -1.212 at\n",
      "    -1.348 in\n",
      "    -1.986 on\n",
      "\n",
      "Highest and lowest feature weights for relation worked_at:\n",
      "\n",
      "     3.107 president\n",
      "     2.913 head\n",
      "     2.743 professor\n",
      "     ..... .....\n",
      "    -1.134 province\n",
      "    -1.150 author\n",
      "    -1.714 or\n",
      "\n"
     ]
    }
   ],
   "source": [
    "rel_ext.examine_model_weights(train_result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By and large, the high-weight features for each relation are pretty intuitive — they are words that are used to express the relation in question. (The counter-intuitive results merit a bit of investigation!)\n",
    "\n",
    "The low-weight features (that is, features with large negative weights) may be a bit harder to understand. In some cases, however, they can be interpreted as features which indicate some _other_ relation which is anti-correlated with the target relation. (As an example, \"directed\" is a negative indicator for the `author` relation.)\n",
    "\n",
    "__Optional exercise:__ Investigate one of the counter-intuitive high-weight features. Find the training examples which caused the feature to be included. Given the training data, does it make sense that this feature is a good predictor for the target relation?\n",
    "\n",
    "<!--\n",
    "- SPOILER: Using `penalty='l1'` results in somewhat less intuitive feature weights, and about the same performance.\n",
    "- SPOILER: Using `penalty='l1', C=0.1` results in much more intuitive feature weights, but much worse performance.\n",
    "-->"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Discovering new relation instances\n",
    "\n",
    "Another way to gain insight into our trained models is to use them to discover new relation instances that don't currently appear in the KB. In fact, this is the whole point of building a relation extraction system: to extend an existing KB (or build a new one) using knowledge extracted from natural language text at scale. Can the models we've trained do this effectively?\n",
    "\n",
    "Because the goal is to discover new relation instances which are *true* but *absent from the KB*, we can't evaluate this capability automatically. But we can generate candidate KB triples and manually evaluate them for correctness.\n",
    "\n",
    "To do this, we'll start from corpus examples containing pairs of entities which do not belong to any relation in the KB (earlier, we described these as \"negative examples\"). We'll then apply our trained models to each pair of entities, and sort the results by probability assigned by the model, in order to find the most likely new instances for each relation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Highest probability examples for relation adjoins:\n",
      "\n",
      "     1.000 KBTriple(rel='adjoins', sbj='Canada', obj='Vancouver')\n",
      "     1.000 KBTriple(rel='adjoins', sbj='Vancouver', obj='Canada')\n",
      "     1.000 KBTriple(rel='adjoins', sbj='Australia', obj='Sydney')\n",
      "     1.000 KBTriple(rel='adjoins', sbj='Sydney', obj='Australia')\n",
      "     1.000 KBTriple(rel='adjoins', sbj='Mexico', obj='Atlantic_Ocean')\n",
      "     1.000 KBTriple(rel='adjoins', sbj='Atlantic_Ocean', obj='Mexico')\n",
      "     1.000 KBTriple(rel='adjoins', sbj='Dubai', obj='United_Arab_Emirates')\n",
      "     1.000 KBTriple(rel='adjoins', sbj='United_Arab_Emirates', obj='Dubai')\n",
      "     1.000 KBTriple(rel='adjoins', sbj='Sydney', obj='New_South_Wales')\n",
      "     1.000 KBTriple(rel='adjoins', sbj='New_South_Wales', obj='Sydney')\n",
      "\n",
      "Highest probability examples for relation author:\n",
      "\n",
      "     1.000 KBTriple(rel='author', sbj='Oliver_Twist', obj='Charles_Dickens')\n",
      "     1.000 KBTriple(rel='author', sbj='Jane_Austen', obj='Pride_and_Prejudice')\n",
      "     1.000 KBTriple(rel='author', sbj='Iliad', obj='Homer')\n",
      "     1.000 KBTriple(rel='author', sbj='Divine_Comedy', obj='Dante_Alighieri')\n",
      "     1.000 KBTriple(rel='author', sbj='Pride_and_Prejudice', obj='Jane_Austen')\n",
      "     1.000 KBTriple(rel='author', sbj=\"Euclid's_Elements\", obj='Euclid')\n",
      "     1.000 KBTriple(rel='author', sbj='Aldous_Huxley', obj='The_Doors_of_Perception')\n",
      "     1.000 KBTriple(rel='author', sbj=\"Uncle_Tom's_Cabin\", obj='Harriet_Beecher_Stowe')\n",
      "     1.000 KBTriple(rel='author', sbj='Ray_Bradbury', obj='Fahrenheit_451')\n",
      "     1.000 KBTriple(rel='author', sbj='A_Christmas_Carol', obj='Charles_Dickens')\n",
      "\n",
      "Highest probability examples for relation capital:\n",
      "\n",
      "     1.000 KBTriple(rel='capital', sbj='Delhi', obj='India')\n",
      "     1.000 KBTriple(rel='capital', sbj='Bangladesh', obj='Dhaka')\n",
      "     1.000 KBTriple(rel='capital', sbj='India', obj='Delhi')\n",
      "     1.000 KBTriple(rel='capital', sbj='Lucknow', obj='Uttar_Pradesh')\n",
      "     1.000 KBTriple(rel='capital', sbj='Chengdu', obj='Sichuan')\n",
      "     1.000 KBTriple(rel='capital', sbj='Dhaka', obj='Bangladesh')\n",
      "     1.000 KBTriple(rel='capital', sbj='Uttar_Pradesh', obj='Lucknow')\n",
      "     1.000 KBTriple(rel='capital', sbj='Sichuan', obj='Chengdu')\n",
      "     1.000 KBTriple(rel='capital', sbj='Bandung', obj='West_Java')\n",
      "     1.000 KBTriple(rel='capital', sbj='West_Java', obj='Bandung')\n",
      "\n",
      "Highest probability examples for relation contains:\n",
      "\n",
      "     1.000 KBTriple(rel='contains', sbj='Delhi', obj='India')\n",
      "     1.000 KBTriple(rel='contains', sbj='Dubai', obj='United_Arab_Emirates')\n",
      "     1.000 KBTriple(rel='contains', sbj='Campania', obj='Naples')\n",
      "     1.000 KBTriple(rel='contains', sbj='India', obj='Uttarakhand')\n",
      "     1.000 KBTriple(rel='contains', sbj='Bangladesh', obj='Dhaka')\n",
      "     1.000 KBTriple(rel='contains', sbj='India', obj='Delhi')\n",
      "     1.000 KBTriple(rel='contains', sbj='Uttarakhand', obj='India')\n",
      "     1.000 KBTriple(rel='contains', sbj='Australia', obj='Melbourne')\n",
      "     1.000 KBTriple(rel='contains', sbj='Palawan', obj='Philippines')\n",
      "     1.000 KBTriple(rel='contains', sbj='Canary_Islands', obj='Tenerife')\n",
      "\n",
      "Highest probability examples for relation film_performance:\n",
      "\n",
      "     1.000 KBTriple(rel='film_performance', sbj='Amitabh_Bachchan', obj='Mohabbatein')\n",
      "     1.000 KBTriple(rel='film_performance', sbj='Mohabbatein', obj='Amitabh_Bachchan')\n",
      "     1.000 KBTriple(rel='film_performance', sbj='A_Christmas_Carol', obj='Charles_Dickens')\n",
      "     1.000 KBTriple(rel='film_performance', sbj='Charles_Dickens', obj='A_Christmas_Carol')\n",
      "     1.000 KBTriple(rel='film_performance', sbj='De-Lovely', obj='Kevin_Kline')\n",
      "     1.000 KBTriple(rel='film_performance', sbj='Kevin_Kline', obj='De-Lovely')\n",
      "     1.000 KBTriple(rel='film_performance', sbj='Akshay_Kumar', obj='Sonakshi_Sinha')\n",
      "     1.000 KBTriple(rel='film_performance', sbj='Sonakshi_Sinha', obj='Akshay_Kumar')\n",
      "     1.000 KBTriple(rel='film_performance', sbj='Iliad', obj='Homer')\n",
      "     1.000 KBTriple(rel='film_performance', sbj='Homer', obj='Iliad')\n",
      "\n",
      "Highest probability examples for relation founders:\n",
      "\n",
      "     1.000 KBTriple(rel='founders', sbj='Iliad', obj='Homer')\n",
      "     1.000 KBTriple(rel='founders', sbj='Homer', obj='Iliad')\n",
      "     1.000 KBTriple(rel='founders', sbj='William_C._Durant', obj='Louis_Chevrolet')\n",
      "     1.000 KBTriple(rel='founders', sbj='Louis_Chevrolet', obj='William_C._Durant')\n",
      "     1.000 KBTriple(rel='founders', sbj='Mongol_Empire', obj='Genghis_Khan')\n",
      "     1.000 KBTriple(rel='founders', sbj='Genghis_Khan', obj='Mongol_Empire')\n",
      "     1.000 KBTriple(rel='founders', sbj='Elon_Musk', obj='SpaceX')\n",
      "     1.000 KBTriple(rel='founders', sbj='SpaceX', obj='Elon_Musk')\n",
      "     1.000 KBTriple(rel='founders', sbj='Marvel_Comics', obj='Stan_Lee')\n",
      "     1.000 KBTriple(rel='founders', sbj='Stan_Lee', obj='Marvel_Comics')\n",
      "\n",
      "Highest probability examples for relation genre:\n",
      "\n",
      "     1.000 KBTriple(rel='genre', sbj='Oliver_Twist', obj='Charles_Dickens')\n",
      "     1.000 KBTriple(rel='genre', sbj='Charles_Dickens', obj='Oliver_Twist')\n",
      "     0.999 KBTriple(rel='genre', sbj='Mark_Twain_Tonight', obj='Hal_Holbrook')\n",
      "     0.999 KBTriple(rel='genre', sbj='Hal_Holbrook', obj='Mark_Twain_Tonight')\n",
      "     0.997 KBTriple(rel='genre', sbj='The_Dark_Side_of_the_Moon', obj='Pink_Floyd')\n",
      "     0.997 KBTriple(rel='genre', sbj='Pink_Floyd', obj='The_Dark_Side_of_the_Moon')\n",
      "     0.991 KBTriple(rel='genre', sbj='Andrew_Garfield', obj='Sam_Raimi')\n",
      "     0.991 KBTriple(rel='genre', sbj='Sam_Raimi', obj='Andrew_Garfield')\n",
      "     0.981 KBTriple(rel='genre', sbj='Life_of_Pi', obj='Man_Booker_Prize')\n",
      "     0.981 KBTriple(rel='genre', sbj='Man_Booker_Prize', obj='Life_of_Pi')\n",
      "\n",
      "Highest probability examples for relation has_sibling:\n",
      "\n",
      "     1.000 KBTriple(rel='has_sibling', sbj='Jess_Margera', obj='April_Margera')\n",
      "     1.000 KBTriple(rel='has_sibling', sbj='April_Margera', obj='Jess_Margera')\n",
      "     1.000 KBTriple(rel='has_sibling', sbj='Lincoln_Borglum', obj='Gutzon_Borglum')\n",
      "     1.000 KBTriple(rel='has_sibling', sbj='Gutzon_Borglum', obj='Lincoln_Borglum')\n",
      "     1.000 KBTriple(rel='has_sibling', sbj='Rufus_Wainwright', obj='Kate_McGarrigle')\n",
      "     1.000 KBTriple(rel='has_sibling', sbj='Kate_McGarrigle', obj='Rufus_Wainwright')\n",
      "     1.000 KBTriple(rel='has_sibling', sbj='Philip_II_of_Macedon', obj='Alexander_the_Great')\n",
      "     1.000 KBTriple(rel='has_sibling', sbj='Alexander_the_Great', obj='Philip_II_of_Macedon')\n",
      "     1.000 KBTriple(rel='has_sibling', sbj='Ronald_Goldman', obj='Nicole_Brown_Simpson')\n",
      "     1.000 KBTriple(rel='has_sibling', sbj='Nicole_Brown_Simpson', obj='Ronald_Goldman')\n",
      "\n",
      "Highest probability examples for relation has_spouse:\n",
      "\n",
      "     1.000 KBTriple(rel='has_spouse', sbj='Akhenaten', obj='Tutankhamun')\n",
      "     1.000 KBTriple(rel='has_spouse', sbj='Tutankhamun', obj='Akhenaten')\n",
      "     1.000 KBTriple(rel='has_spouse', sbj='United_Artists', obj='Douglas_Fairbanks')\n",
      "     1.000 KBTriple(rel='has_spouse', sbj='Douglas_Fairbanks', obj='United_Artists')\n",
      "     1.000 KBTriple(rel='has_spouse', sbj='Ronald_Goldman', obj='Nicole_Brown_Simpson')\n",
      "     1.000 KBTriple(rel='has_spouse', sbj='Nicole_Brown_Simpson', obj='Ronald_Goldman')\n",
      "     1.000 KBTriple(rel='has_spouse', sbj='William_C._Durant', obj='Louis_Chevrolet')\n",
      "     1.000 KBTriple(rel='has_spouse', sbj='Louis_Chevrolet', obj='William_C._Durant')\n",
      "     1.000 KBTriple(rel='has_spouse', sbj='England', obj='Charles_II_of_England')\n",
      "     1.000 KBTriple(rel='has_spouse', sbj='Charles_II_of_England', obj='England')\n",
      "\n",
      "Highest probability examples for relation is_a:\n",
      "\n",
      "     1.000 KBTriple(rel='is_a', sbj='Canada', obj='Vancouver')\n",
      "     1.000 KBTriple(rel='is_a', sbj='Felidae', obj='Panthera')\n",
      "     1.000 KBTriple(rel='is_a', sbj='Panthera', obj='Felidae')\n",
      "     1.000 KBTriple(rel='is_a', sbj='Vancouver', obj='Canada')\n",
      "     1.000 KBTriple(rel='is_a', sbj='Phasianidae', obj='Bird')\n",
      "     1.000 KBTriple(rel='is_a', sbj='Bird', obj='Phasianidae')\n",
      "     1.000 KBTriple(rel='is_a', sbj='Accipitridae', obj='Bird')\n",
      "     1.000 KBTriple(rel='is_a', sbj='Bird', obj='Accipitridae')\n",
      "     1.000 KBTriple(rel='is_a', sbj='Automobile', obj='South_Korea')\n",
      "     1.000 KBTriple(rel='is_a', sbj='South_Korea', obj='Automobile')\n",
      "\n",
      "Highest probability examples for relation nationality:\n",
      "\n",
      "     1.000 KBTriple(rel='nationality', sbj='Titus', obj='Roman_Empire')\n",
      "     1.000 KBTriple(rel='nationality', sbj='Roman_Empire', obj='Titus')\n",
      "     1.000 KBTriple(rel='nationality', sbj='Philip_II_of_Macedon', obj='Alexander_the_Great')\n",
      "     1.000 KBTriple(rel='nationality', sbj='Alexander_the_Great', obj='Philip_II_of_Macedon')\n",
      "     1.000 KBTriple(rel='nationality', sbj='Mongol_Empire', obj='Genghis_Khan')\n",
      "     1.000 KBTriple(rel='nationality', sbj='Genghis_Khan', obj='Mongol_Empire')\n",
      "     1.000 KBTriple(rel='nationality', sbj='Norodom_Sihamoni', obj='Cambodia')\n",
      "     1.000 KBTriple(rel='nationality', sbj='Cambodia', obj='Norodom_Sihamoni')\n",
      "     1.000 KBTriple(rel='nationality', sbj='Tamil_Nadu', obj='Ramanathapuram_district')\n",
      "     1.000 KBTriple(rel='nationality', sbj='Ramanathapuram_district', obj='Tamil_Nadu')\n",
      "\n",
      "Highest probability examples for relation parents:\n",
      "\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     1.000 KBTriple(rel='parents', sbj='Philip_II_of_Macedon', obj='Alexander_the_Great')\n",
      "     1.000 KBTriple(rel='parents', sbj='Lincoln_Borglum', obj='Gutzon_Borglum')\n",
      "     1.000 KBTriple(rel='parents', sbj='Gutzon_Borglum', obj='Lincoln_Borglum')\n",
      "     1.000 KBTriple(rel='parents', sbj='Alexander_the_Great', obj='Philip_II_of_Macedon')\n",
      "     1.000 KBTriple(rel='parents', sbj='Anne_Boleyn', obj='Thomas_Boleyn,_1st_Earl_of_Wiltshire')\n",
      "     1.000 KBTriple(rel='parents', sbj='Thomas_Boleyn,_1st_Earl_of_Wiltshire', obj='Anne_Boleyn')\n",
      "     1.000 KBTriple(rel='parents', sbj='Jess_Margera', obj='April_Margera')\n",
      "     1.000 KBTriple(rel='parents', sbj='April_Margera', obj='Jess_Margera')\n",
      "     1.000 KBTriple(rel='parents', sbj='Saddam_Hussein', obj='Uday_Hussein')\n",
      "     1.000 KBTriple(rel='parents', sbj='Uday_Hussein', obj='Saddam_Hussein')\n",
      "\n",
      "Highest probability examples for relation place_of_birth:\n",
      "\n",
      "     1.000 KBTriple(rel='place_of_birth', sbj='Lucknow', obj='Uttar_Pradesh')\n",
      "     1.000 KBTriple(rel='place_of_birth', sbj='Uttar_Pradesh', obj='Lucknow')\n",
      "     0.999 KBTriple(rel='place_of_birth', sbj='Philip_II_of_Macedon', obj='Alexander_the_Great')\n",
      "     0.999 KBTriple(rel='place_of_birth', sbj='Alexander_the_Great', obj='Philip_II_of_Macedon')\n",
      "     0.999 KBTriple(rel='place_of_birth', sbj='Nepal', obj='Bagmati_Zone')\n",
      "     0.999 KBTriple(rel='place_of_birth', sbj='Bagmati_Zone', obj='Nepal')\n",
      "     0.998 KBTriple(rel='place_of_birth', sbj='Chengdu', obj='Sichuan')\n",
      "     0.998 KBTriple(rel='place_of_birth', sbj='Sichuan', obj='Chengdu')\n",
      "     0.998 KBTriple(rel='place_of_birth', sbj='San_Antonio', obj='Actor')\n",
      "     0.998 KBTriple(rel='place_of_birth', sbj='Actor', obj='San_Antonio')\n",
      "\n",
      "Highest probability examples for relation place_of_death:\n",
      "\n",
      "     1.000 KBTriple(rel='place_of_death', sbj='Philip_II_of_Macedon', obj='Alexander_the_Great')\n",
      "     1.000 KBTriple(rel='place_of_death', sbj='Alexander_the_Great', obj='Philip_II_of_Macedon')\n",
      "     1.000 KBTriple(rel='place_of_death', sbj='Titus', obj='Roman_Empire')\n",
      "     1.000 KBTriple(rel='place_of_death', sbj='Roman_Empire', obj='Titus')\n",
      "     1.000 KBTriple(rel='place_of_death', sbj='Lucknow', obj='Uttar_Pradesh')\n",
      "     1.000 KBTriple(rel='place_of_death', sbj='Uttar_Pradesh', obj='Lucknow')\n",
      "     1.000 KBTriple(rel='place_of_death', sbj='Chengdu', obj='Sichuan')\n",
      "     1.000 KBTriple(rel='place_of_death', sbj='Sichuan', obj='Chengdu')\n",
      "     1.000 KBTriple(rel='place_of_death', sbj='Roman_Empire', obj='Trajan')\n",
      "     1.000 KBTriple(rel='place_of_death', sbj='Trajan', obj='Roman_Empire')\n",
      "\n",
      "Highest probability examples for relation profession:\n",
      "\n",
      "     1.000 KBTriple(rel='profession', sbj='Canada', obj='Vancouver')\n",
      "     1.000 KBTriple(rel='profession', sbj='Vancouver', obj='Canada')\n",
      "     1.000 KBTriple(rel='profession', sbj='Little_Women', obj='Louisa_May_Alcott')\n",
      "     1.000 KBTriple(rel='profession', sbj='Louisa_May_Alcott', obj='Little_Women')\n",
      "     0.999 KBTriple(rel='profession', sbj='Aldous_Huxley', obj='Eyeless_in_Gaza')\n",
      "     0.999 KBTriple(rel='profession', sbj='Eyeless_in_Gaza', obj='Aldous_Huxley')\n",
      "     0.999 KBTriple(rel='profession', sbj='Jess_Margera', obj='April_Margera')\n",
      "     0.999 KBTriple(rel='profession', sbj='April_Margera', obj='Jess_Margera')\n",
      "     0.999 KBTriple(rel='profession', sbj='Actor', obj='Screenwriter')\n",
      "     0.999 KBTriple(rel='profession', sbj='Screenwriter', obj='Actor')\n",
      "\n",
      "Highest probability examples for relation worked_at:\n",
      "\n",
      "     1.000 KBTriple(rel='worked_at', sbj='William_C._Durant', obj='Louis_Chevrolet')\n",
      "     1.000 KBTriple(rel='worked_at', sbj='Louis_Chevrolet', obj='William_C._Durant')\n",
      "     1.000 KBTriple(rel='worked_at', sbj='Iliad', obj='Homer')\n",
      "     1.000 KBTriple(rel='worked_at', sbj='Homer', obj='Iliad')\n",
      "     1.000 KBTriple(rel='worked_at', sbj='Marvel_Comics', obj='Stan_Lee')\n",
      "     1.000 KBTriple(rel='worked_at', sbj='Stan_Lee', obj='Marvel_Comics')\n",
      "     1.000 KBTriple(rel='worked_at', sbj='Mongol_Empire', obj='Genghis_Khan')\n",
      "     1.000 KBTriple(rel='worked_at', sbj='Genghis_Khan', obj='Mongol_Empire')\n",
      "     1.000 KBTriple(rel='worked_at', sbj='Comic_book', obj='Marvel_Comics')\n",
      "     1.000 KBTriple(rel='worked_at', sbj='Marvel_Comics', obj='Comic_book')\n",
      "\n"
     ]
    }
   ],
   "source": [
    "rel_ext.find_new_relation_instances(\n",
    "    dataset,\n",
    "    featurizers=[simple_bag_of_words_featurizer])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are actually some good discoveries here! The predictions for the `author` relation seem especially good. Of course, there are also plenty of bad results, and a few that are downright comical. We may hope that as we improve our models and optimize performance in our automatic evaluations, the results we observe in this manual evaluation improve as well.\n",
    "\n",
    "__Optional exercise:__ Note that every time we predict that a given relation holds between entities `X` and `Y`, we also predict, with equal confidence, that it holds between `Y` and `X`. Why? How could we fix this?\n",
    "\n",
    "\\[ [top](#Relation-extraction-using-distant-supervision) \\]"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  },
  "widgets": {
   "state": {},
   "version": "1.1.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
